Summary

We use Ridge algorithm here to build a regression model to predict the popularity of spotify tracks based on features like danceability, loudness, tempo etc. The popularity score ranges from 0 to 100. A popularity score of 0 means the the song has minute popularity and a popularity score of 100 means the sonf is extremely popular.

Introduction

Some songs sit atop popularity charts like the Billboard charts while certain other songs comfortably sit at the bottom of the charts. Some songs don’t even chart at all. This pose an interesting question to us on what exactly makes a song popular and we ask if we can be able to predict how popular a song will get. based on certain features. Some songs are unexceptionally popular while some other songs are not as popular. This is an attempt to answer this interesting question. We attempt here to make a prediction on the popularity of a song based on certain features.

According to this report, approximately 137 million new songs are released every year, and only about 14 records have sold 15 million physical copies or more in global history. Therefore, it is important to determine what exactly determines a track popularity and specifically make predictions on how popular a song will get based on based on features like danceability, loudness, tempo etc.

Methods

Data

The dataset used in this project was sourced from Tidy Tuesday’s github repo here, and particularly here. The data, however, originally comes from Data.World, Billboard.com and Spotify. EAch row from the dataset represents a song’s features and a target column specifying the song’s popularity on a scale of 0 (least popularity) to 100 (most popularity).

Analysis

Ridge model was built to answer our research question (to predict spotify tracks popularity). This is a regression solution and predictions range from 0 (least popularity) to 100 (most popularity). All features in the original dataset were used to fit the model with the exceptions of ‘song_id,’ ‘spotify_track_id,’ ‘spotify_track_album’ features. A 10-fold cross-validation was used for hyperparameter optimization. The code used to perform this analysis can be found here.

Results & Discussion

It is usually very important to look at how the features are co-related and to see what their pairwise distributions look like. Here, the blue plots (and a fitting line) represents the paired distributions of the features, and the other boxes are the paired corrrelations of the features. As can be seen, the correlations are fair and not unreasonable, hence the features can be used together for building the Ridge model that seeks to answer the predictive question.

Figure 1. Pairwise distributions and correlations of all features

Figure 1. Pairwise distributions and correlations of all features

We adopted a simple linear regression model - Ridge algorithm. Our choice of Ridge stems from the fact it it is regularized and take care of the multi-collinearity problem. A 10-fold cross validation was carried out and the train and validation \(R^2\) scores reported in the table below from cross-validation

## Rows: 10 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): fit_time, score_time, test_score, train_score
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Table 1. Train and validation scores from cross-validation
fit_time score_time test_score train_score
5.1090078 0.0688303 0.4782244 0.7925249
7.1154528 0.0713170 0.4743461 0.7916930
7.1507609 0.0694790 0.5038553 0.7894538
5.1150637 0.0818172 0.4764264 0.7929407
0.8690741 0.0628850 0.4458901 0.7935001
0.8043900 0.0585492 0.4792189 0.7910602
0.5807800 0.0558510 0.4706912 0.7920377
0.5296991 0.0522580 0.4846401 0.7931793
0.5396821 0.0521779 0.4645201 0.7923253
0.5420330 0.0518708 0.5038660 0.7905986

The following table shows the results of RandomizedSearchCV for determining the best hyperparameters for athe Ridge model.

## Rows: 10 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): mean_test_score, param_ridge__alpha, param_columntransformer__count...
## lgl (2): param_columntransformer__countvectorizer-1__binary, param_columntra...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Table 2. Best hyperparameters from RandomizedSearchCV
mean_test_score param_ridge__alpha param_columntransformer__countvectorizer-1__max_features param_columntransformer__countvectorizer-1__binary param_columntransformer__countvectorizer-2__max_features param_columntransformer__countvectorizer-2__binary
0.4970392 1e+00 1000 TRUE 1000 TRUE
0.4940855 1e+00 1000 FALSE 1000 TRUE
0.4522361 1e-01 1000 FALSE 1000 FALSE
0.4408474 1e+02 1000 TRUE 1000 FALSE
0.4406922 1e-02 1000 TRUE 1000 TRUE
0.4406878 1e+02 1000 TRUE 1000 TRUE
0.4403347 1e-02 1000 TRUE 1000 FALSE
0.4385679 1e-03 1000 TRUE 1000 FALSE
0.4384198 1e-02 1000 FALSE 1000 FALSE
0.4360704 1e-03 1000 FALSE 1000 FALSE

In order to evaluate the performance of our model, we made some predictions and compared the predicted values with the actual values. We have plotted this below. The Goodness of Fit below is not unreasonable and shows the viability of the ridge model.

Figure 1. Comparison of actual vs. predicted values

Figure 1. Comparison of actual vs. predicted values

In order to improve this model in the future where we can have excellent reliabilty on the model predictions, we will need the right combination of data. The data used here are mostly Spotify and Billboard-based. In the future, we’ll look at aggregating data from other sources as well. Also, the ridge model deployed did not perform greatly, we’ll look into more sophisticated feature engineering and model training in the future.

References

Canadian Cancer Statistics Advisory Committee. 2019. “Canadian Cancer Statistics.” Canadian Cancer Society. http://cancer.ca/Canadian-Cancer-Statistics-2019-EN.
de Jonge, Edwin. 2018. Docopt: Command-Line Interface Specification Language. https://CRAN.R-project.org/package=docopt.
Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2019. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.
Keleshev, Vladimir. 2014. Docopt: Command-Line Interface Description Language. https://github.com/docopt/docopt.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Van Rossum, Guido, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.
Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.